Background

In this course we will explain the theory behind common statistical tests to compare two or more groups. We will illustrate this using biologically-motivated examples and R code that uses the “tidyverse” techniques of data manipulation and visualization.

To install the entire collection of tidyverse packages (which includes other useful data-related packages that we will not use today), you may wish to use the following:-

## Make take some time!
install.packages("tidyverse")

Statistics has been a fundamental part of the R language from the beginning. A vast array of statistical tests and associated visualizations are supported. However, the functions to perform these predate the creation of the tidyverse and are used in a manner that might be somewhat counter-intuitive to someone that is familiar with the tidyverse. Therefore we will use a couple of add-on packages that have been created to perform statistics in a manner that is compatible with the tidyverse eco-system.

The specific set of required packages is as follows:-

install.packages(c("dplyr",
                   "ggplot2",
                   "readr",
                   "readxl",
                   "rstatix",
                   "ggpubr",
                   "rmarkdown")) 

Part I - Contingency tables

When working with categorical variables, we are usually interested in the frequencies of the different categories in our sample. To display data for two or more categorical variables, cross-tabulations, or contingency tables, are commonly used - with 2 x 2 tables being the simplest. We can then test whether there is an association between the row factor and the column factor by a chi-squared test or a Fisher’s exact test.

To demonstrate the analysis of contingency tables we will use a dataset provided with the vcd package. You will need to install this package using the install.packages R function.

install.packages("vcd")
library(dplyr)
library(ggplot2)
library(rstatix)
library(ggpubr)

The data frame Arthritis should then be accessible which is described as:-

Data from Koch & Edwards (1988) from a double-blind clinical trial investigating a new treatment for rheumatoid arthritis.

#let's use the'Arthritis' dataset in the 'vcd' package 
library(vcd)
Arthritis

The count function, included in dplyr can be used to give a tabulation of the values in any column in the data frame.

##one way table (count of one variable)
count(Arthritis,Sex)

We can quickly visualise these counts as a barplot. A pie chart is possible, but not generally recommended.

ggplot(Arthritis, aes(x = Sex)) + geom_bar()

The function freq_table is another way of making the counts and will also add the proportions as an extra column.

#to get proportion of males and females
freq_table(Arthritis,Sex) 
NA

The count function can also compare two columns from a data frame if both columns are given as arguments.

count(Arthritis,Sex,Improved)

These can also be plotted

count(Arthritis,Sex,Improved) %>% 
ggplot(aes(x = Sex, fill = Improved, y = n)) + geom_col()

count(Arthritis,Sex,Improved) %>% 
ggplot(aes(x = Sex, fill = Improved, y = n)) + geom_col(position = "dodge")

The freq_table function can be used to calculate the proportions. However, we need to pay attention to the order in which the variables are specified as this will dictate how the proportions are calculated

Compare the output of:-

freq_table(Arthritis,Sex, Improved)

to:-

freq_table(Arthritis,Improved, Sex)

Having explored our data, we can now perform statistical testing. The function chisq_test can be used to assess whether differences in proportions are significant or not. We actually don’t need to calculate the proportions; R will do this for us.

However, we need to re-format the data slightly into a wide table rather than the default long nature of a data frame in the tidyverse. We the pivot_wider function we create a two-by-two table that is typically used for a contingency analysis.

Arthritis %>% 
  count(Improved, Treatment) %>% 
  tidyr::pivot_wider(values_from = n,names_from = Improved)
NA

And now remove the Treatment column as the chisq_test function is only expecting numeric data.

The results are presented in a tidyverse tibble that and we can interpret the test statistics and p-value from this table

Arthritis %>% 
  count(Improved, Treatment) %>% 
  tidyr::pivot_wider(values_from = n,names_from = Improved) %>% 
select(-Treatment)  %>% 
  chisq_test()

However, the chisq_test function is not appropriate in all circumstances.

Arthritis %>% 
  count(Improved, Sex) %>% 
  tidyr::pivot_wider(values_from = n,names_from = Improved) %>% 
  select(-Sex)  %>% 
  chisq_test
Warning: Chi-squared approximation may be incorrect

The Fisher test is recommended for tables with low numbers of observations (e.g. when more than 20% of cells have expected frequencies < 5)

Arthritis %>% 
  count(Improved, Sex) %>% 
  tidyr::pivot_wider(values_from = n,names_from = Improved) %>% 
 select(-Sex)  %>% 
  fisher_test
NA

Exercise

1- Read the excel file called Ex Biostat P1.xlsx into R (see below for the required code)

2- Use the counts function to make a cross-tabulation of Tumour grade and Gender

3- Determine the percentage of Grade III tumors within females

4- Use the appropriate test to check if the tumor grade depends on the gender

## the readxl package is required to read xls and xlsx files into R
## However, csv and tsv files are recommended to store data

tab <- readxl::read_xlsx("data/EX Biostat P1.xlsx")

Part II - How to assess normality

We will read some example data to illustrate how one would test for a normally-distributed variable. This property is important as it influences which test we should use.

One of the best ways of displaying data is by using a graph. Graphs can make both simple and complex data easier to understand by making it easier to spot trends and patterns. We can use plots to view the distribution of our data (minimum, maximum, mid-point, spread etc) and to ensure that the values in our dataset seem realistic (e.g. no outliers). Many statistical tests rely on the assumption that the data are normally distributed.

The data for this section are to be found in the file normal_example.csv in the data folder. You will need to specify the file path accordingly.

library(readr)
df1 <- read_csv("data/normal_example.csv")

We can inspect the data in RStudio and discover that it consists of a tidy dataset with numeric values in a column called Values and a column Var to indicate a variable name (x)

View(df1)

Various graphical methods are available to assess the distribution. The first of which is a histogram. Here, the data are split into “bins” (ggplot2 chooses the number of bins) and the value on the y axis corresponds to the number of observations in that bin. The user only has to specify the variable to be plotted, and ggplot2 takes care of the binning. From this plot we can judge what the average value of the data is, and the spread.

Histograms can be made in ggplot2 by using the geom_hist function. Here we are going to make use of the gghistogram function from ggpubr.

gghistogram(df1,x="Value")
Warning: Using `bins = 30` by default. Pick better value with the argument `bins`.

A similar option is to produce a density curve with ggdensity (also from ggpubr). Here the y-axis is the density of a particular value.

ggdensity(df1, x="Value")

Combining Histograms and Density

When assessing the distribution of a variable, you might be tempted to plot the histogram and density on the same plot. gghistogram has the argument add_density, which should do the job.

gghistogram(df1,x="Value",add_density = TRUE)
Warning: Using `bins = 30` by default. Pick better value with the argument `bins`.

However, this doesn’t work. If you check the y-axis limits on the histogram and density plots, you’ll notice they are on a different scale. Conveniently there is an argument in gghistogram that solves this issue.

gghistogram(df1,x="Value",
            add_density = TRUE,
            y="..density..")
Warning: Using `bins = 30` by default. Pick better value with the argument `bins`.

We can add the standard normal curve with the following:-

gghistogram(df1,x="Value",add_density = TRUE,y="..density..") + stat_overlay_normal_density(col="steelblue",lwd=2,lty=2)
Warning: Using `bins = 30` by default. Pick better value with the argument `bins`.

A box plot is an excellent way of displaying continuous data when you are interested in the spread of your data. The box of the box plot corresponds to the lower and upper quartiles of the respective observations and the bar within the box, the median. The whiskers of the box plot correspond to the distance between the lower/upper quartile and the smaller of: the smaller/largest measurement OR 1.5 times the inter quartile range. A disadvantage of the box plot is that you don’t see the exact data points. However, box plots are very useful in large datasets where plotting all of the data may give an unclear picture of the shape of your data.

## setting x to be empty is required to make a boxplot from one variable
ggboxplot(df1, y = "Value")

A violin plot is sometimes used in conjunction with the boxplot to show density information.

ggviolin(df1, y = "Var")

Individual points can also be added with geom_jitter; avoiding over-plotting by adding random noise along the x-axis.

ggviolin(df1, y = "Value",add = "jitter")

Finally, we have a “qq-plot” which allows to compare the quantiles of our dataset against a theoretical normal distribution. If the majority of points lie on a diagonal line then the data are approximately normal.

ggqqplot(df1, x="Value")

These graphical methods are by far the easiest way to assess if a given dataset is normally-distributed.

For “real-life” data, the results are unlikely to give a perfect plot, so some degree of judgement and prior experience with the data type are required. Indeed, it should be noted that the dataset visualised in the above plots was sampled from a normal distribution. Even then, the plots were not 100% convincing!

Tests for normality

Although their usage is contentious amongst statisticians, there are a few methods for testing whether variables are normally-distributed or not. If the p-value is sufficiently small then we conclude that the data are not normally distributed. However, some statisticians prefer to use graphical methods and their intuition about the data or prior knowledge of the data type (e.g. some measures are generally believed to be normally-distributed)


#shapiro test
#p<0.05 ...difference between data and normality..data not normal
#p>0.05 ...no diff between data and normality ..data normally distributed

df1 %>% 
shapiro_test(Value) %>% 
  mutate(p < 0.05)
NA

Descriptive Statistics

In the accompanying R course we have seen how to produce summary statistics of columns in a dataset. For a dataset that is normally-distributed, appropriate measures of the average and variability are the mean and standard deviation. Both these functions are available within R and can be used in conjunction with the summarise function in dplyr.

df1 %>% 
  get_summary_stats 
df1 %>% 
  get_summary_stats %>% 
  select(mean, sd)

Exercise

1- Read the excel file called Ex Biostat P2.xlsx into R

2- Identify if the age and hospitalization days are normally distributed

3- Use the appropriate descriptive statistics [mean and SD, or median and IQR] for each variable

Part III - Significance tests for continuous variables

In this part we will show how to perform tests to compare 1, 2 (or more) continuous variables. The dataset, provided by MASH at The University of Sheffield, describes individuals that have been following different diets and their age and gender. The main goal of interest is to determine which of three competing diet regimes results in the greatest weight loss. However, we can use the dataset to demonstrate other types of test.

diet <- read_csv("data/diet.csv")
diet

One-sample test

The first hypothesis we will test is whether the people in the study are overweight or not. This first involves some manipulation of the table to calculate an extra variable; the Body Mass Index (BMI). We will test if people in our study are overweight, where overweight is defined as having a BMI over 25.

\(BMI = weight / height^2\)

(where the weight is measured in kg, and the height in metres)

Exercise

  • Add a new variable to the data frame for the BMI of each person
    • you might want to do this in multiple steps using the %>% notation

We can now test our new variable for normality using the plots and tests from earlier

gghistogram(diet, x = "BMI", ,add_density = TRUE,y="..density..")+ stat_overlay_normal_density(col="steelblue")

The one-sample t-test is implemented in the function t_test (as are the various types of t-test that we will see). The variables to be used in the test are defined using R’s formula ~ syntax.

Y ~ X

Where Y is a numeric variable and X is a categorical variable indicating the

In the case of a one-sample test we are testing one numeric variable against a know or population mean. This is written as:-

Y ~ 1

Note that this not the same as testing against a population mean of 1! t_test will provide us with a t test statistic and a p-value. However, it is up to us to interpret the p-value according. There is no guarentee that the test we are conducting is appropriate.

diet %>% 
  t_test(BMI ~1)

We get a hugely significant result! However, if we look at the description for t_test it is testing against a population mean of \(0\). It is no surprise that we get a significant result! By changing the mu argument we can perform a test to see if the people in the study are overweight to begin with (using 25 as the population or known mean).

diet %>% 
  t_test(BMI ~1,mu = 25,alternative = "greater")
## 25 is the cutoff for overweight
#t.test(diet$BMI, mu=25,alternative = "greater")

Two-sample tests

A two-sample t-test should be used if you want to compare the measurements of two populations. There are two types of two-sample t-test: independent (unpaired) and paired (dependent). To make the correct choice, you need to understand your underlying data.

  • An independent two-sample t-test is used when the two samples are independent of each other, e.g. comparing the mean response of two groups of patients on treatment vs. control in a clinical trial.

  • As the name suggests,a paired two-sample t-test is used when the two samples are paired, e.g. comparing the mean blood pressure of patients before and after treatment (two measurements per patient).

Back to our dataset, we might wonder if there is actually any effect due to diet so we will compare the intial and final weights.

diet %>% 
  select(contains("weight")) 

We will not go through all the plotting options from before, but here is the histogram and qqplot for the initial weight.

diet %>% 
  select(contains("weight")) %>% 
  gghistogram(x = "initial.weight", add_density = TRUE, y = "..density..") + stat_overlay_normal_density(col="red")
Warning: Using `bins = 30` by default. Pick better value with the argument `bins`.

diet %>% 
  select(contains("weight")) %>% 
  ggqqplot(x = "initial.weight")

Exercise: Do you think the initial weight variable is normally-distributed? Vote on wooclap now!

Re-formating the dataset for tidy analysis

The dataset in it’s current form is not suitable for analysis with tidy methods. Before proceeding we need to re-format the data into two columns.

  • One column containing numeric variables (each value of weight)
  • An indicator (or categorical) variable to denote that group each numeric observation belongs to (initial or final weight)

The pivot_longer function supports this transformation. As we want to use all columns in our existing dataset we have to use the everything() shortcut to select the columns. The names of our new columns can be chosen, but we will chose Y and X to be consistent with the formula notation introduced earlier.

diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") 

So now Y contains all our observations of weight and X indicates when the weights were measured. The gghistogram function we introduced earlier is able to visualise data in this form, and we can use the facet.by argument to produce separate plots for each type of weight measurement.

diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") %>% 
  gghistogram(x = "Y", facet.by = "X",add_density = TRUE, y = "..density..") + stat_overlay_normal_density(col="red")
Warning: Using `bins = 30` by default. Pick better value with the argument `bins`.

Similarly, introducing a group_by function will allow us to create summaries for each group separately.

diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") %>% 
  group_by(X) %>% 
  get_summary_stats()
diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") %>% 
  group_by(X) %>% 
  shapiro_test(Y)

The t_test function requires us to create a formula

diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") %>% 
  t_test(Y ~ X)
## Just to prove that we get the same result as base R
## Will remove before the course
t.test(diet$final.weight, diet$initial.weight)

    Welch Two Sample t-test

data:  diet$final.weight and diet$initial.weight
t = -3.0342, df = 149.98, p-value = 0.002843
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.515747 -1.376358
sample estimates:
mean of x mean of y 
 68.34342  72.28947 

The p-value is significant and shows that overall the weights of individuals is different before and after diet. However, this test is not specifically testing for a decrease after the diet (which we would really hope to be the case). By adding an extra argument alternative=less we get a different result

diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") %>% 
  t_test(Y ~ X,alternative = "less")
t.test(diet$final.weight, diet$initial.weight,alternative = "less")

    Welch Two Sample t-test

data:  diet$final.weight and diet$initial.weight
t = -3.0342, df = 149.98, p-value = 0.001422
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -1.793602
sample estimates:
mean of x mean of y 
 68.34342  72.28947 

Easy way of showing stats result on the plot is using stat_compare_means

diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Weight",names_to = "Time") %>% 
  ggplot(aes(x = Time, y = Weight)) + geom_boxplot() + geom_jitter(width=0.1) + stat_compare_means(method = "t.test",method.args = list(alternative = "less"))

There is also extra information that we could employ; namely that the measurements of weight are made on the same person before and after dieting. This is a classic example of when to apply a paired t-test. Again, we do not need to use a different function to perform the test; only add an argument paired=TRUE to t_test.

diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Weight",names_to = "Time") %>% 
  t_test(Weight ~ Time,alternative = "less",paired=TRUE)
t.test(diet$final.weight, diet$initial.weight,alternative = "less", paired=TRUE)

    Paired t-test

data:  diet$final.weight and diet$initial.weight
t = -13.728, df = 75, p-value < 2.2e-16
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
     -Inf -3.46735
sample estimates:
mean difference 
      -3.946053 

A convenient plot in ggpubr will allow us to visualise the paired differences

ggpaired(diet, cond1="initial.weight", cond2="final.weight",line.color = "grey") + stat_compare_means(paired=TRUE,method = "t.test")

The Independent t test, with two independent groups

Lets consider that we want to compare whether males or females lost more weight during the trial. Here we have two groups and these can be treated as independant variables as different patients belong to the two groups.

The null hypothesis for such a test would be that the weight loss is the same between groups male and female. We seek to evidence to reject this hypothesis by calculating a test statistic. Firstly, we have to check for normality:-

diet %>% 
mutate(weight.loss = initial.weight - final.weight) %>% 
ggdensity("weight.loss",color = "gender") + stat_overlay_normal_density()


diet %>%  
  mutate(weight.loss = initial.weight - final.weight) %>% 
  group_by(gender) %>% 
  shapiro_test(weight.loss)
NA

We conclude that the variances are approximately the same and both variables are normally-distributed

We can use the t_test function to perform an independant test. As before, the first argument to t_test is the R formula notation for the test being performed. However, this time we do not need the pivot_longer function

This function allows various type of test to be performed by changing the appropriate arguments (see the help for t.test for details (?t.test)). For instance, we can tell the test that we believe our variances are equal or not.

diet %>% 
  mutate(weight.loss = initial.weight - final.weight) %>% 
  t_test(weight.loss ~ gender)

We can see that the t-statistic we observe is consistent with the null hypothesis, that the weight loss in males and females is the same. That is, the probability of observing a t-statistic of 0.2 or more, or -0.2 or less, is quite high.

This is not a significant result (p>0.05), so there is no evidence of a difference in weight loss between males and females

diet %>% 
  mutate(weight.loss = initial.weight - final.weight) %>% 
  ggplot(aes(x = gender, y=weight.loss)) + geom_boxplot() + stat_compare_means(method="t.test")

Non- Parametric alternatives (e.g. the Wilcoxon test)

Being able to use the t_test relies on the your data being normally-distributed. If we do not sufficient confidence in this assumption, there are different statistical tests that can be applied. Rather than calculating and comparing the means and variances of different groups they are rank-based methods. However, they still come with a set of assumptions and involve the generation of test statistics and p-values.

Independent samples = Wilcoxon rank sum test (Mann Whitney U test)

This test has many different names including the Wilcoxon, Wilcoxon two sample test, Wilcoxon-Mann-Whitney, Wilcoxon rank sum and the Mann-Whitney-U test. However, this test should not be confused with the Wilcoxon signed rank test (which is used for paired tests). To avoid confusion this test is usually referred to as the Mann-Whitney U test, which is used when the dependent variable to be examined is continuous but the assumptions for parametric tests are violated.

The assumptions of the Mann-Whitney U are as follows:

1.The dependent variable is ordinal or continuous.

2.The data consist of a randomly selected sample of independent observations from two independent groups.

3.The dependent variables for the two independent groups share a similar shape.

Fortunately, the R developers have made the function to do a Wilcox-test similar to doing a t-test. The difficulty is in choosing the correct test to apply - which R will not advise you on.

Let’s go back to our example of comparing weight loss between groups male and female. The equivalent non-parametric version of the test we performed before is:-

diet %>% 
  mutate(weight.loss = initial.weight - final.weight) %>% 
  wilcox_test(weight.loss ~ gender)

The wilcox_test is flexible in much the same way that t_test in. We can switch to applying a paired test by adding the argument paired=TRUE.

diet %>% 
  select(initial.weight,final.weight) %>% 
  tidyr::pivot_longer(everything(),values_to = "Weight",names_to = "Time") %>% 
  wilcox_test(Weight ~ Time, paired=TRUE)

Compare between more than two groups

Parametric (ANOVA)

The two-sample t-test is useful when we have just two groups of continuous data to compare. When we want to compare more than two groups, a one-way ANOVA can be used to simultaneously compare all groups, rather than carrying out several individual two-sample t-tests. The main advantage of doing this is that it reduces the number of tests being carried out, meaning that the type I error rate (the probability of seeing a significant result just by chance) does not become inflated.

In order to justify if an ANOVA test is appropriate we have to test for normality.

diet <- mutate(diet, weight.loss = initial.weight - final.weight)

diet %>%   
gghistogram(x = "weight.loss", facet.by = "diet.type", add_density = TRUE, y = "..density..") + stat_overlay_normal_density(col="red")
Warning: Using `bins = 30` by default. Pick better value with the argument `bins`.

group_by(diet, diet.type) %>% 
  shapiro_test(weight.loss)


diet %>% 
  ggqqplot(x = "weight.loss", facet.by = "diet.type")

A one-way ANOVA compares group means by partitioning the variation in the data into between group variance and within group variance. Like the other statistical tests we have encountered, the functions in R do the hard work of calculating the statistics. The anova_test function is a tidy version of the ANOVA test.

diet %>% 
  anova_test(weight.loss ~ diet.type)
ANOVA Table (type II tests)

     Effect DFn DFd     F     p p<.05   ges
1 diet.type   2  73 5.383 0.007     * 0.129

When the test provides a significant result (like above) it tells us that there is at least on difference in the groups. However, it does not tell us which group is different. For this, we can apply a “post-hoc test” such as the Tukey test. If anova_test did not produce a significant p-value, we wouldn’t proceed with this step

diet %>%  
  tukey_hsd(weight.loss ~ diet.type)

As we have seen previously, a standard method of presenting the differences between groups is to use the stat_compare_means function to automatically add p-values to a boxplot or violin plot

ggplot(diet, aes(x = diet.type, y = weight.loss)) + geom_violin() + geom_jitter(width=0.1) + stat_compare_means(method="anova")

However, in the case of more than two groups it will only show a single p-value from the ANOVA rather than individual comparisons. We can explicitly list particular contrasts we are interested in.

my_comparisons <- list( c("A", "B"), c("A", "C"), c("B", "C") )

ggplot(diet, aes(x = diet.type, y = weight.loss)) + geom_violin() + geom_jitter() +  stat_compare_means(method = "t.test",comparisons = my_comparisons)

Alternatively, we can manually-compute the p-values and add these to the plot.

stat_res <- diet %>% 
  tukey_hsd(weight.loss ~ diet.type)

ggplot(diet, aes(x = diet.type, y = weight.loss)) + geom_violin() + stat_pvalue_manual(stat_res, label = "p.adj",y.position = c(11, 13, 15))

Non-Parametric (Kruskal Wallis)

Data that do not meet the assumptions of ANOVA (e.g. normality) can be tested using a non-parametric alternative. The Kruskal-Wallis test is derived from the one-way ANOVA, but uses ranks rather than actual observations. It is also the extension of the Mann-Whitney U test to greater than two groups.

diet %>% 
  kruskal_test(weight.loss ~ diet.type)
kruskal.test(weight.loss ~ as.factor(diet.type), data=diet)

    Kruskal-Wallis rank sum test

data:  weight.loss by as.factor(diet.type)
Kruskal-Wallis chi-squared = 9.4159, df = 2, p-value =
0.009023

Like the one-way ANOVA this will only tell us that at least one group is different and not specifically which group(s). The post-hoc dunn.test is recommended which also performs multiple testing correction.

diet %>% 
  dunn_test(weight.loss ~ diet.type)

At this point we could be about to recommend diet C to those that wish to lose weight. However, are there any other factors in the data that we should be considering? With ggplot2 we can quite easily visualise the effects of multiple factors on the data. Lets add both gender and diet type into the plot. It now appears that diet C is having an effect on males but not females.

ggplot(diet, aes(x = diet.type, y = weight.loss,fill=gender)) + geom_boxplot()

Two-way ANOVA

The formula notation allows us to specify an interaction between gender and diet type. In other words, we are looking to see if the effect of diet type is different for males and females. In R, the formula for an interaction is specified using a * between the variables that we are interested in assessing the interaction for.

diet %>% 
  anova_test(weight.loss ~ diet.type*gender)
ANOVA Table (type II tests)

            Effect DFn DFd     F     p p<.05      ges
1        diet.type   2  70 5.619 0.005     * 0.138000
2           gender   1  70 0.031 0.860       0.000448
3 diet.type:gender   2  70 3.153 0.049     * 0.083000

This tells us that an effect exists between diet type and gender, but like before we have to run a post-hoc test to discover more

diet %>% 
  tukey_hsd(weight.loss ~ diet.type*gender)

Repeated Measures

CHECK THIS CODE DOES WHAT WE WANT

We will read a modified version of the diet dataset in order to test a “repeated measures” analysis. Here we have added a midpoint weight measurement.

diet2 <- read_csv("data/diet2.csv")
View(diet2)

However, the three measures that we want to compare are given as columns in the data frame. We cannot express this using the R ~ notation. In other words the data is in wide format and not long. We can change this using the tidyr package. This creates a variable for the type of measurement (initial, mid or final) and a variable containing the corresponding value. The default column names are name and value respectively.

### With "::" you can use a function from a package you haven't loaded yet

diet_melt <- diet %>% select(id,gender, contains("weight")) %>% 
  tidyr::pivot_longer(3:5, values_to = "value", names_to = "time_point") %>% 
  mutate(id = as.factor(id))
anova_test(diet_melt, dv =value, 
           wid = id, 
           within = time_point)
ANOVA Table (type III tests)

$ANOVA
      Effect DFn DFd        F         p p<.05   ges
1 time_point   2 150 4473.166 1.96e-134     * 0.957

$`Mauchly's Test for Sphericity`
      Effect    W        p p<.05
1 time_point 0.21 7.66e-26     *

$`Sphericity Corrections`
      Effect   GGe      DF[GG]    p[GG] p[GG]<.05   HFe
1 time_point 0.559 1.12, 83.78 2.55e-76         * 0.561
       DF[HF]    p[HF] p[HF]<.05
1 1.12, 84.15 1.21e-76         *

Exercise

The excel file ‘RCC2’ contains data about the expression levels of some genes in patients with renal cell carcinoma. In your study, you put the following hypotheses. Please test those alternative hypotheses and state whether you will accept or reject each one.

  1. Females have a higher level of E2F3 than males
  2. ANXA expression levels vary between unilateral and bilateral tumors
  3. Individuals with RCC grade II have different levels of E2F3 than those with grade III or IV
  4. The mean value of miR499 decreases significantly after treatment
  5. DFFA is higher in patients with grade IV tumors

Solutions

To be revealed during the workshop!

Exercise 1

###OLD

##1-Read the excel file called 'EX Biostat P1' into R 
Biostat1 <- readxl::read_xlsx("data/EX Biostat P1.xlsx")

##2-Make a cross table showing the gender in the rows and tumor grade in the columns
GenderGrade <- table(Biostat1$Gender, Biostat1$`Tumor grade`)
GenderGrade
        
          I II III
  Female 29 12  30
  Male   18 34   7
##3-Define the percentage of Grade III tumors within females 
prop.table(GenderGrade,1)*100 #(42.3%)
        
                I       II      III
  Female 40.84507 16.90141 42.25352
  Male   30.50847 57.62712 11.86441
##4-Add a column in the table showing the sum of the 3 grades in males and in females and state the total number of males and females in the sample 
addmargins(GenderGrade,2) #(71 Females, 59 males)
        
          I II III Sum
  Female 29 12  30  71
  Male   18 34   7  59
##5-Use the appropriate test to check if the tumor grade depends on the gender 
chisq.test(GenderGrade) #grade depends on gender (significant)

    Pearson's Chi-squared test

data:  GenderGrade
X-squared = 26.512, df = 2, p-value = 1.75e-06
## NEW

Biostat1 <- readxl::read_xlsx("data/EX Biostat P1.xlsx")
##2-Make a cross table showing the gender in the rows and tumor grade in the columns

count(Biostat1, Gender, `Tumor grade`)


##3-Define the percentage of Grade III tumors within females 
freq_table(Biostat1, Gender, `Tumor grade`)

##4-Use the appropriate test to check if the tumor grade depends on the gender 

count(Biostat1, Gender, `Tumor grade`) %>% 
    tidyr::pivot_wider(everything(), names_from = "Gender",values_from = "n") %>% 
  select(-`Tumor grade`) %>% 
  chisq_test()
Warning: Specifying the `id_cols` argument by position was deprecated in tidyr 1.3.0.
Please explicitly name `id_cols`, like `id_cols = everything()`.

Exercise 2


## OLD

##1-Read the excel file called 'EX Biostat P2' into R 
Biostat2 <- readxl::read_xlsx("data/EX Biostat P2.xlsx")

hist(Biostat2$Age,freq = FALSE)
lines(density(Biostat2$Age))


##2-Identify if the age and hospitalization days are normally distributed
library(ggplot2)
ggplot(Biostat2, aes(x=Age)) + geom_histogram(aes(y=..density..,),binwidth = 2) + geom_density()

ggplot(Biostat2, aes(x="",y=Age))  + geom_violin() + geom_boxplot()

ggplot(Biostat2,aes(sample=Age)) + geom_qq() + geom_qq_line()

shapiro.test(Biostat2$Age)

    Shapiro-Wilk normality test

data:  Biostat2$Age
W = 0.97405, p-value = 0.6923
# Age normally distributed


ggplot(Biostat2, aes(x=`hospitalization days`)) + geom_histogram(aes(y=..density..)) + geom_density()

ggplot(Biostat2, aes(x="",y=`hospitalization days`))  + geom_violin() + geom_boxplot()

ggplot(Biostat2,aes(sample=`hospitalization days`)) + geom_qq() + geom_qq_line()

shapiro.test(Biostat2$`hospitalization days`)

    Shapiro-Wilk normality test

data:  Biostat2$`hospitalization days`
W = 0.89026, p-value = 0.006817
# hospitalization days NOT normally distributed

##3-Use the appropriate descriptive statistics [mean+-SD or median (IQ range)] for each variable
library(tidyverse)
Warning: package ‘tidyr’ was built under R version 4.3.2
summarise(Biostat2, MeanAge = mean(Age), sdAge = sd(Age), 
          medianHospDays=median(`hospitalization days`), 
          Q1=quantile(`hospitalization days`, 0.25), 
          Q3=quantile(`hospitalization days`, 0.75))
#Age: 9.54 +/- 2.89 [mean +- SD]  & Hospitalization days: 11.5 (9.00 - 14.25) [median(IQ range)]
###NEW

##1-Read the excel file called 'EX Biostat P2' into R 
Biostat2 <- readxl::read_xlsx("data/EX Biostat P2.xlsx")

gghistogram(Biostat2, x = "Age")
Warning: Using `bins = 30` by default. Pick better value with the argument `bins`.

gghistogram(Biostat2, x = "Age", add_density = TRUE, y = "..density..") + stat_overlay_normal_density(col="blue")
Warning: Using `bins = 30` by default. Pick better value with the argument `bins`.

ggqqplot(Biostat2, x = "Age")

Biostat2 %>% 
  shapiro_test(vars = "Age")
# Age is normally distributed

gghistogram(Biostat2, x = "hospitalization days")
Warning: Using `bins = 30` by default. Pick better value with the argument `bins`.

gghistogram(Biostat2, x = "hospitalization days", add_density = TRUE, y = "..density..") + stat_overlay_normal_density(col="blue")
Warning: Using `bins = 30` by default. Pick better value with the argument `bins`.

ggqqplot(Biostat2, x = "hospitalization days")

Biostat2 %>% 
  shapiro_test(vars = "hospitalization days")

# hospitalization days NOT normally distributed

get_summary_stats(Biostat2)
NA
---
title: "Statistical Analysis of Biological Data"
author: "Mark Dunning, Niamh Errington and Aya Elwazir"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output: 
  html_notebook: 
    toc: yes
    toc_float: yes
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,message=FALSE)
```

# Background

In this course we will explain the theory behind common statistical tests to compare two or more groups. We will illustrate this using biologically-motivated examples and R code that uses the "tidyverse" techniques of data manipulation and visualization.

To install the entire collection of tidyverse packages (which includes other useful data-related packages that we will not use today), you may wish to use the following:-

```{r eval=FALSE}
## Make take some time!
install.packages("tidyverse")
```

Statistics has been a fundamental part of the R language from the beginning. A vast array of statistical tests and associated visualizations are supported. However, the functions to perform these predate the creation of the `tidyverse` and are used in a manner that might be somewhat counter-intuitive to someone that is familiar with the `tidyverse`. Therefore we will use a couple of add-on packages that have been created to perform statistics in a manner that is compatible with the tidyverse eco-system.

- [rstatix: Statistical testing for the tidyverse](https://rpkgs.datanovia.com/rstatix/)
- [ggpubr: ‘ggplot2’ Based Publication Ready Plots](https://rpkgs.datanovia.com/ggpubr/)


The specific set of required packages is as follows:-

```{r eval=FALSE}
install.packages(c("dplyr",
                   "ggplot2",
                   "readr",
                   "readxl",
                   "rstatix",
                   "ggpubr",
                   "rmarkdown")) 
```



# Part I - Contingency tables

When working with categorical variables, we are usually interested in the frequencies of the different categories in our sample. To display data for two or more categorical variables, cross-tabulations, or contingency tables, are commonly used - with 2 x 2 tables being the simplest. We can then test whether there is an association between the row factor and the column factor by a *chi-squared test* or a *Fisher’s exact test*.

To demonstrate the analysis of contingency tables we will use a dataset provided with the `vcd` package. You will need to install this package using the `install.packages` R function.


```{r eval=FALSE}
install.packages("vcd")
```


```{r}
library(dplyr)
library(ggplot2)
library(rstatix)
library(ggpubr)
```


The data frame `Arthritis` should then be accessible which is described as:- 

> Data from Koch & Edwards (1988) from a double-blind clinical trial investigating a new treatment for rheumatoid arthritis.

```{r}
#let's use the'Arthritis' dataset in the 'vcd' package 
library(vcd)
Arthritis
```

The `count` function, included in `dplyr` can be used to give a tabulation of the values in any column in the data frame.

```{r message=FALSE}
##one way table (count of one variable)
count(Arthritis,Sex)
```

We can quickly visualise these counts as a *barplot*. A pie chart is possible, but not generally recommended.

```{r}
ggplot(Arthritis, aes(x = Sex)) + geom_bar()
```

The function `freq_table` is another way of making the counts and will also add the proportions as an extra column.

```{r}
#to get proportion of males and females
freq_table(Arthritis,Sex) 

```

The `count` function can also compare two columns from a data frame if both columns are given as arguments.

```{r}
count(Arthritis,Sex,Improved)
```

These can also be plotted 

```{r}
count(Arthritis,Sex,Improved) %>% 
ggplot(aes(x = Sex, fill = Improved, y = n)) + geom_col()
```


```{r}
count(Arthritis,Sex,Improved) %>% 
ggplot(aes(x = Sex, fill = Improved, y = n)) + geom_col(position = "dodge")
```

The `freq_table` function can be used to calculate the proportions. However, we need to pay attention to the order in which the variables are specified as this will dictate how the proportions are calculated

Compare the output of:-

```{r}
freq_table(Arthritis,Sex, Improved)
```

to:- 

```{r}
freq_table(Arthritis,Improved, Sex)
```


Having explored our data, we can now perform statistical testing. The function `chisq_test` can be used to assess whether differences in proportions are significant or not. We actually don't need to calculate the proportions; R will do this for us.

However, we need to re-format the data slightly into a *wide* table rather than the default *long* nature of a data frame in the `tidyverse`. We the `pivot_wider` function we create a two-by-two table that is typically used for a contingency analysis.

```{r}
Arthritis %>% 
  count(Improved, Treatment) %>% 
  tidyr::pivot_wider(values_from = n,names_from = Improved)

```

And now remove the `Treatment` column as the `chisq_test` function is only expecting numeric data.

The results are presented in a `tidyverse` tibble that and we can interpret the test statistics and p-value from this table

```{r}
Arthritis %>% 
  count(Improved, Treatment) %>% 
  tidyr::pivot_wider(values_from = n,names_from = Improved) %>% 
select(-Treatment)  %>% 
  chisq_test()
```

However, the `chisq_test` function is not appropriate in all circumstances.

```{r}
Arthritis %>% 
  count(Improved, Sex) %>% 
  tidyr::pivot_wider(values_from = n,names_from = Improved) %>% 
  select(-Sex)  %>% 
  chisq_test
```

The Fisher test is recommended for tables with low numbers of observations (e.g. when more than 20% of cells have expected frequencies < 5)

```{r}
Arthritis %>% 
  count(Improved, Sex) %>% 
  tidyr::pivot_wider(values_from = n,names_from = Improved) %>% 
 select(-Sex)  %>% 
  fisher_test

```

<div class="alert alert-warning">

**Exercise**

1- Read the excel file called `Ex Biostat P1.xlsx` into R (see below for the required code)

2- Use the `counts` function to make a cross-tabulation of Tumour grade and Gender

3- Determine the *percentage* of Grade III tumors within females 

4- Use the appropriate test to check if the tumor grade depends on the gender 

```{r eval=FALSE}
## the readxl package is required to read xls and xlsx files into R
## However, csv and tsv files are recommended to store data

tab <- readxl::read_xlsx("data/EX Biostat P1.xlsx")

```

</div>




# Part II - How to assess normality

We will read some example data to illustrate how one would test for a normally-distributed variable. This property is important as it influences which test we should use.

One of the best ways of displaying data is by using a graph. Graphs can make both simple and complex data easier to understand by making it easier to spot trends and patterns. We can use plots to view the distribution of our data (minimum, maximum, mid-point, spread etc) and to ensure that the values in our dataset seem realistic (e.g. no outliers). Many statistical tests rely on the assumption that the data are normally distributed.

The data for this section are to be found in the file `normal_example.csv` in the `data` folder. You will need to specify the file path accordingly.

```{r message=FALSE}
library(readr)
df1 <- read_csv("data/normal_example.csv")
```

We can inspect the data in RStudio and discover that it consists of a tidy dataset with numeric values in a column called `Values` and a column `Var` to indicate a variable name (`x`)

```{r}
View(df1)
```

Various graphical methods are available to assess the distribution. The first of which is a *histogram*. Here, the data are split into "bins" (`ggplot2` chooses the number of bins) and the value on the `y` axis corresponds to the number of observations in that bin. The user only has to specify the variable to be plotted, and `ggplot2` takes care of the binning. From this plot we can judge what the average value of the data is, and the spread.

Histograms can be made in `ggplot2` by using the `geom_hist` function. Here we are going to make use of the `gghistogram` function from `ggpubr`.

```{r}
gghistogram(df1,x="Value")
```


A similar option is to produce a density curve with `ggdensity` (also from `ggpubr`). Here the y-axis is the *density* of a particular value.

```{r}
ggdensity(df1, x="Value")
```

### Combining Histograms and Density 

When assessing the distribution of a variable, you might be tempted to plot the histogram and density on the same plot. `gghistogram` has the argument `add_density`, which should do the job.

```{r}
gghistogram(df1,x="Value",add_density = TRUE)
```

However, this doesn't work. If you check the y-axis limits on the histogram and density plots, you'll notice they are on a different scale. Conveniently there is an argument in `gghistogram` that solves this issue.


```{r}
gghistogram(df1,x="Value",
            add_density = TRUE,
            y="..density..")
```

We can add the *standard* normal curve with the following:-

```{r}
gghistogram(df1,x="Value",add_density = TRUE,y="..density..") + stat_overlay_normal_density(col="steelblue",lwd=2,lty=2)
```




A box plot is an excellent way of displaying continuous data when you are interested in the spread of your data. The box of the box plot corresponds to the lower and upper quartiles of the respective observations and the bar within the box, the median. The whiskers of the box plot correspond to the distance between the lower/upper quartile and the smaller of: the smaller/largest measurement *OR* 1.5 times the inter quartile range. A disadvantage of the box plot is that you don’t see the exact data points. However, box plots are very useful in large datasets where plotting all of the data may give an unclear picture of the shape of your data.

```{r}
## setting x to be empty is required to make a boxplot from one variable
ggboxplot(df1, y = "Value")
```

A *violin plot* is sometimes used in conjunction with the boxplot to show density information.

```{r}
ggviolin(df1, y = "Var")
```

Individual points can also be added with `geom_jitter`; avoiding over-plotting by adding random noise along the x-axis.

```{r}
ggviolin(df1, y = "Value",add = "jitter")
```


Finally, we have a "*qq-plot*" which allows to compare the quantiles of our dataset against a theoretical normal distribution. If the majority of points lie on a diagonal line then the data are approximately normal.

```{r}
ggqqplot(df1, x="Value")
```

These graphical methods are by far the easiest way to assess if a given dataset is normally-distributed.

For "real-life" data, the results are unlikely to give a perfect plot, so some degree of judgement and prior experience with the data type are required.  Indeed, it should be noted that the dataset visualised in the above plots was sampled from a normal distribution. Even then, the plots were not 100% convincing!

## Tests for normality

Although their usage is contentious amongst statisticians, there are a few methods for testing whether variables are normally-distributed or not. If the p-value is sufficiently small then we conclude that the data are not normally distributed. However, some statisticians prefer to use graphical methods and their intuition about the data or prior knowledge of the data type (e.g. some measures are generally believed to be normally-distributed)

```{r}

#shapiro test
#p<0.05 ...difference between data and normality..data not normal
#p>0.05 ...no diff between data and normality ..data normally distributed

df1 %>% 
shapiro_test(Value) %>% 
  mutate(p < 0.05)

```

## Descriptive Statistics

In the [accompanying R course](http://sbc.shef.ac.uk/r-crash-course/) we have seen how to produce summary statistics of columns in a dataset. For a dataset that is normally-distributed, appropriate measures of the average and variability are the *mean* and *standard deviation*. Both these functions are available within R and can be used in conjunction with the `summarise` function in `dplyr`.

```{r}
df1 %>% 
  get_summary_stats 
```

```{r}
df1 %>% 
  get_summary_stats %>% 
  select(mean, sd)
```


<div class="alert alert-warning">
**Exercise**

1- Read the excel file called `Ex Biostat P2.xlsx` into R 

2- Identify if the age and hospitalization days are normally distributed

3- Use the appropriate descriptive statistics [mean and SD,  or median and IQR] for each variable

</div>



# Part III - Significance tests for continuous variables

In this part we will show how to perform tests to compare 1, 2 (or more) continuous variables. The dataset, provided by MASH at The University of Sheffield, describes individuals that have been following different diets and their age and gender. The main goal of interest is to determine which of three competing diet regimes results in the greatest weight loss. However, we can use the dataset to demonstrate other types of test.

```{r message=FALSE}
diet <- read_csv("data/diet.csv")
diet
```



## One-sample test

The first hypothesis we will test is whether the people in the study are overweight or not. This first involves some manipulation of the table to calculate an extra variable; the Body Mass Index (BMI). We will test if people in our study are overweight, where overweight is defined as having a BMI over 25.

$BMI = weight  / height^2$

(*where the weight is measured in kg, and the height in metres*)

<div class="alert alert-warning">
**Exercise**

- Add a new variable to the data frame for the BMI of each person 
    + you might want to do this in multiple steps using the `%>%` notation
</div>

```{r echo=FALSE}
diet <- diet %>% 
  mutate(BMI =initial.weight / (height/100)^2)

```

We can now test our new variable for normality using the plots and tests from earlier

```{r eval=FALSE}
gghistogram(diet, x = "BMI", ,add_density = TRUE,y="..density..")+ stat_overlay_normal_density(col="steelblue")

```

The one-sample t-test is implemented in the function `t_test` (as are the various types of t-test that we will see). The variables to be used in the test are defined using R's formula `~` syntax. 

`Y ~ X`

Where `Y` is a *numeric* variable and `X` is a categorical variable indicating the 

In the case of a one-sample test we are testing one numeric variable against a know or population mean. This is written as:-

`Y ~ 1`

Note that this not the same as testing against a population mean of 1! `t_test` will provide us with a *t* test statistic and a p-value. **However, it is up to us to interpret the p-value according**. There is no guarentee that the test we are conducting is appropriate.

```{r}
diet %>% 
  t_test(BMI ~1)
```

We get a hugely significant result! However, if we look at the description for `t_test` it is testing against a population mean of $0$. It is no surprise that we get a significant result! By changing the `mu` argument we can perform a test to see if the people in the study are overweight to begin with (using 25 as the population or known mean).

```{r}
diet %>% 
  t_test(BMI ~1,mu = 25,alternative = "greater")
## 25 is the cutoff for overweight
#t.test(diet$BMI, mu=25,alternative = "greater")
```


##  Two-sample tests

A two-sample t-test should be used if you want to compare the measurements of two populations. There are two types of two-sample t-test: independent (unpaired) and paired (dependent). To make the correct choice, you need to understand your underlying data. 

- An independent two-sample t-test is used when the two samples are independent of each other, e.g. *comparing the mean response of two groups of patients on treatment vs. control in a clinical trial*. 

- As the name suggests,a paired two-sample t-test is used when the two samples are paired, e.g. *comparing the mean blood pressure of patients before and after treatment* (two measurements per patient).

Back to our dataset, we might wonder if there is actually any effect due to diet so we will compare the intial and final weights.

```{r}
diet %>% 
  select(contains("weight")) 
```
We will not go through all the plotting options from before, but here is the histogram and qqplot for the initial weight.

```{r}
diet %>% 
  select(contains("weight")) %>% 
  gghistogram(x = "initial.weight", add_density = TRUE, y = "..density..") + stat_overlay_normal_density(col="red")
```

```{r}
diet %>% 
  select(contains("weight")) %>% 
  ggqqplot(x = "initial.weight")
```
<div class="exercise">
**Exercise:** Do you think the initial weight variable is normally-distributed? Vote on wooclap now!
</div>


## Re-formating the dataset for tidy analysis

The dataset in it's current form is not suitable for analysis with tidy methods. Before proceeding we need to re-format the data into two columns. 

- One column containing numeric variables (each value of weight)
- An indicator (or categorical) variable to denote that group each numeric observation belongs to (initial or final weight)

The `pivot_longer` function supports this transformation. As we want to use all columns in our existing dataset we have to use the `everything()` shortcut to select the columns. The names of our new columns can be chosen, but we will chose `Y` and `X` to be consistent with the formula notation introduced earlier.

```{r}
diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") 
```

So now `Y` contains all our observations of weight and `X` indicates when the weights were measured. The `gghistogram` function we introduced earlier is able to visualise data in this form, and we can use the `facet.by` argument to produce separate plots for each type of weight measurement.

```{r}
diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") %>% 
  gghistogram(x = "Y", facet.by = "X",add_density = TRUE, y = "..density..") + stat_overlay_normal_density(col="red")
```

Similarly, introducing a `group_by` function will allow us to create summaries for each group separately.

```{r}
diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") %>% 
  group_by(X) %>% 
  get_summary_stats()
```


```{r}
diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") %>% 
  group_by(X) %>% 
  shapiro_test(Y)
```
The `t_test` function requires us to create a *formula*


```{r}
diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") %>% 
  t_test(Y ~ X)
```


```{r}
## Just to prove that we get the same result as base R
## Will remove before the course
t.test(diet$final.weight, diet$initial.weight)
```

The p-value is significant and shows that overall the weights of individuals is *different* before and after diet. However, this test is not specifically testing for a *decrease* after the diet (which we would really hope to be the case). By adding an extra argument `alternative=less` we get a different result

```{r}
diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Y",names_to = "X") %>% 
  t_test(Y ~ X,alternative = "less")
```


```{r}
t.test(diet$final.weight, diet$initial.weight,alternative = "less")
```

Easy way of showing stats result on the plot is using `stat_compare_means`

```{r}
diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Weight",names_to = "Time") %>% 
  ggplot(aes(x = Time, y = Weight)) + geom_boxplot() + geom_jitter(width=0.1) + stat_compare_means(method = "t.test",method.args = list(alternative = "less"))
```


There is also extra information that we could employ; namely that the measurements of weight are made **on the same person** before and after dieting. This is a classic example of when to apply a paired t-test. Again, we do not need to use a different function to perform the test; only add an argument `paired=TRUE` to `t_test`.

```{r}
diet %>% 
  select(contains("weight")) %>% 
  tidyr::pivot_longer(everything(),values_to = "Weight",names_to = "Time") %>% 
  t_test(Weight ~ Time,alternative = "less",paired=TRUE)
```


```{r}
t.test(diet$final.weight, diet$initial.weight,alternative = "less", paired=TRUE)
```

A convenient plot in `ggpubr` will allow us to visualise the paired differences

```{r}
ggpaired(diet, cond1="initial.weight", cond2="final.weight",line.color = "grey") + stat_compare_means(paired=TRUE,method = "t.test")
```


## The Independent t test, with two independent groups 

Lets consider that we want to compare whether males or females lost more weight during the trial. Here we have two groups and these can be treated as *independant* variables as different patients belong to the two groups.

The *null hypothesis* for such a test would be that the weight loss is the same between groups male and female. We seek to evidence to reject this hypothesis by calculating a test statistic. Firstly, we have to check for normality:-

```{r}
diet %>% 
mutate(weight.loss = initial.weight - final.weight) %>% 
ggdensity("weight.loss",color = "gender") + stat_overlay_normal_density()

diet %>%  
  mutate(weight.loss = initial.weight - final.weight) %>% 
  group_by(gender) %>% 
  shapiro_test(weight.loss)
  
```

We conclude that the variances are approximately the same and both variables are normally-distributed


We can use the `t_test` function to perform an *independant* test. As before, the first argument to `t_test` is the *R formula* notation for the test being performed. However, this time we do not need the `pivot_longer` function

This function allows various type of test to be performed by changing the appropriate arguments (see the help for `t.test` for details (`?t.test`)). For instance, we can tell the test that we believe our variances are equal or not.


```{r}
diet %>% 
  mutate(weight.loss = initial.weight - final.weight) %>% 
  t_test(weight.loss ~ gender)
```


We can see that the t-statistic we observe is consistent with the null hypothesis, that the weight loss in males and females is the same. That is, the probability of observing a t-statistic of 0.2 or more, or -0.2 or less, is quite high.

*This is not a significant result (p>0.05), so there is no evidence of a difference in weight loss between males and females*

```{r}
diet %>% 
  mutate(weight.loss = initial.weight - final.weight) %>% 
  ggplot(aes(x = gender, y=weight.loss)) + geom_boxplot() + stat_compare_means(method="t.test")
```

## Non- Parametric alternatives (e.g. the Wilcoxon test)

Being able to use the `t_test` relies on the your data being normally-distributed. If we do not sufficient confidence in this assumption, there are different statistical tests that can be applied. Rather than calculating and comparing the *means* and *variances* of different groups they are *rank-based* methods. However, they still come with a set of assumptions and involve the generation of test statistics and p-values.

### Independent samples = Wilcoxon rank sum test (Mann Whitney U test)

This test has many different names including the Wilcoxon, Wilcoxon two sample test, Wilcoxon-Mann-Whitney, Wilcoxon rank sum and the Mann-Whitney-U test. However, this test should not be confused with the Wilcoxon signed rank test (which is used for paired tests). To avoid confusion this test is usually referred to as the Mann-Whitney U test, which is used when the dependent variable to be examined is continuous but the assumptions for parametric tests are violated.

The assumptions of the Mann-Whitney U are as follows:

1.The dependent variable is ordinal or continuous.

2.The data consist of a randomly selected sample of independent observations from two independent groups.

3.The dependent variables for the two independent groups share a similar shape.

Fortunately, the R developers have made the function to do a Wilcox-test similar to doing a t-test. **The difficulty is in choosing the correct test to apply - which R will not advise you on**.

Let's go back to our example of comparing weight loss between groups male and female. The equivalent non-parametric version of the test we performed before is:-

```{r}
diet %>% 
  mutate(weight.loss = initial.weight - final.weight) %>% 
  wilcox_test(weight.loss ~ gender)
```

The `wilcox_test` is flexible in much the same way that `t_test` in. We can switch to applying a paired test by adding the argument `paired=TRUE`.

```{r}
diet %>% 
  select(initial.weight,final.weight) %>% 
  tidyr::pivot_longer(everything(),values_to = "Weight",names_to = "Time") %>% 
  wilcox_test(Weight ~ Time, paired=TRUE)
```


## Compare between *more than two* groups

#### Parametric (ANOVA)

The two-sample t-test is useful when we have just two groups of continuous data to compare. When we want to compare more than two groups, a one-way ANOVA can be used to simultaneously compare all groups, rather than carrying out several individual two-sample t-tests.  The main advantage of doing this is that it reduces the number of tests being carried out, meaning that the type I error rate (the probability of seeing a significant result just by chance) does not become inflated. 

In order to justify if an ANOVA test is appropriate we have to test for normality.

```{r}
diet <- mutate(diet, weight.loss = initial.weight - final.weight)

diet %>%   
gghistogram(x = "weight.loss", facet.by = "diet.type", add_density = TRUE, y = "..density..") + stat_overlay_normal_density(col="red")

group_by(diet, diet.type) %>% 
  shapiro_test(weight.loss)


diet %>% 
  ggqqplot(x = "weight.loss", facet.by = "diet.type")

```

A one-way ANOVA compares group means by partitioning the variation in the data into *between group variance* and *within group variance*. Like the other statistical tests we have encountered, the functions in R do the hard work of calculating the statistics. The `anova_test` function is a tidy version of the ANOVA test.

```{r}
diet %>% 
  anova_test(weight.loss ~ diet.type)
```


When the test provides a significant result (like above) it tells us that there is at least on difference in the groups. However, it does not tell us which group is different. For this, we can apply a "post-hoc test" such as the Tukey test. If `anova_test` did not produce a significant p-value, we wouldn't proceed with this step

```{r}
diet %>%  
  tukey_hsd(weight.loss ~ diet.type)
```

As we have seen previously, a standard method of presenting the differences between groups is to use the `stat_compare_means` function to automatically add p-values to a boxplot or violin plot

```{r}
ggplot(diet, aes(x = diet.type, y = weight.loss)) + geom_violin() + geom_jitter(width=0.1) + stat_compare_means(method="anova")
```
However, in the case of more than two groups it will only show a single p-value from the ANOVA rather than individual comparisons. We can explicitly list particular contrasts we are interested in.

```{r}
my_comparisons <- list( c("A", "B"), c("A", "C"), c("B", "C") )

ggplot(diet, aes(x = diet.type, y = weight.loss)) + geom_violin() + geom_jitter() +  stat_compare_means(method = "t.test",comparisons = my_comparisons)
```
Alternatively, we can manually-compute the p-values and add these to the plot.

```{r}
stat_res <- diet %>% 
  tukey_hsd(weight.loss ~ diet.type)

ggplot(diet, aes(x = diet.type, y = weight.loss)) + geom_violin() + stat_pvalue_manual(stat_res, label = "p.adj",y.position = c(11, 13, 15))
```



### Non-Parametric (Kruskal Wallis) 

Data that do not meet the assumptions of ANOVA (e.g. normality) can be tested using a non-parametric alternative. The *Kruskal-Wallis* test is derived from the one-way ANOVA, but uses ranks rather than actual observations. It is also the extension of the Mann-Whitney U test to greater than two groups.

```{r}
diet %>% 
  kruskal_test(weight.loss ~ diet.type)
```


```{r}
kruskal.test(weight.loss ~ as.factor(diet.type), data=diet)
```

Like the one-way ANOVA this will only tell us that at least one group is different and not specifically which group(s). The post-hoc `dunn.test` is recommended which also performs multiple testing correction.

```{r}
diet %>% 
  dunn_test(weight.loss ~ diet.type)
```


At this point we could be about to recommend diet C to those that wish to lose weight. However, are there any other factors in the data that we should be considering? With `ggplot2` we can quite easily visualise the effects of multiple factors on the data. Lets add both gender and diet type into the plot. It now appears that diet C is having an effect on males but not females.

```{r}
ggplot(diet, aes(x = diet.type, y = weight.loss,fill=gender)) + geom_boxplot()
```

## Two-way ANOVA

The *formula* notation allows us to specify an *interaction* between gender and diet type. In other words, we are looking to see if the effect of diet type is different for males and females. In R, the formula for an interaction is specified using a `*` between the variables that we are interested in assessing the interaction for.

```{r}
diet %>% 
  anova_test(weight.loss ~ diet.type*gender)
```


This tells us that an effect exists between diet type and gender, but like before we have to run a post-hoc test to discover more

```{r}
diet %>% 
  tukey_hsd(weight.loss ~ diet.type*gender)
```






## Repeated Measures


**CHECK THIS CODE DOES WHAT WE WANT**

We will read a modified version of the diet dataset in order to test a "repeated measures" analysis. Here we have added a midpoint weight measurement.

```{r}
diet2 <- read_csv("data/diet2.csv")
View(diet2)
```

However, the three measures that we want to compare are given as columns in the data frame. We cannot express this using the R `~` notation. In other words the data is in *wide* format and not *long*. We can change this using the `tidyr` package. This creates a variable for the type of measurement (`initial`, `mid` or `final)` and a variable containing the corresponding value. The default column names are `name` and `value` respectively.

```{r}
### With "::" you can use a function from a package you haven't loaded yet

diet_melt <- diet %>% select(id,gender, contains("weight")) %>% 
  tidyr::pivot_longer(3:5, values_to = "value", names_to = "time_point") %>% 
  mutate(id = as.factor(id))
```

```{r}
anova_test(diet_melt, dv =value, 
           wid = id, 
           within = time_point)
```




<div class="alert alert-warning">
**Exercise**

The excel file ‘RCC2’ contains data about the expression levels of some genes in patients with renal cell carcinoma. In your study, you put the following hypotheses.
Please test those alternative hypotheses and state whether you will accept or reject each one.

1.	Females have a higher level of E2F3 than males 
2.	ANXA expression levels vary between unilateral and bilateral tumors
3.	Individuals with RCC grade II have different levels of E2F3 than those with grade III or IV 
4.	The mean value of miR499 decreases significantly after treatment
5.	DFFA is higher in patients with grade IV tumors
</div>

# Solutions

**To be revealed during the workshop!**

## Exercise 1

```{r}
###OLD

##1-Read the excel file called 'EX Biostat P1' into R 
Biostat1 <- readxl::read_xlsx("data/EX Biostat P1.xlsx")

##2-Make a cross table showing the gender in the rows and tumor grade in the columns
GenderGrade <- table(Biostat1$Gender, Biostat1$`Tumor grade`)
GenderGrade

##3-Define the percentage of Grade III tumors within females 
prop.table(GenderGrade,1)*100 #(42.3%)

##4-Add a column in the table showing the sum of the 3 grades in males and in females and state the total number of males and females in the sample 
addmargins(GenderGrade,2) #(71 Females, 59 males)

##5-Use the appropriate test to check if the tumor grade depends on the gender 
chisq.test(GenderGrade) #grade depends on gender (significant)
```
```{r}
## NEW

Biostat1 <- readxl::read_xlsx("data/EX Biostat P1.xlsx")
##2-Make a cross table showing the gender in the rows and tumor grade in the columns

count(Biostat1, Gender, `Tumor grade`)


##3-Define the percentage of Grade III tumors within females 
freq_table(Biostat1, Gender, `Tumor grade`)

##4-Use the appropriate test to check if the tumor grade depends on the gender 

count(Biostat1, Gender, `Tumor grade`) %>% 
    tidyr::pivot_wider(everything(), names_from = "Gender",values_from = "n") %>% 
  select(-`Tumor grade`) %>% 
  chisq_test()

```

## Exercise 2


```{r}

## OLD

##1-Read the excel file called 'EX Biostat P2' into R 
Biostat2 <- readxl::read_xlsx("data/EX Biostat P2.xlsx")

hist(Biostat2$Age,freq = FALSE)
lines(density(Biostat2$Age))

##2-Identify if the age and hospitalization days are normally distributed
library(ggplot2)
ggplot(Biostat2, aes(x=Age)) + geom_histogram(aes(y=..density..,),binwidth = 2) + geom_density()
ggplot(Biostat2, aes(x="",y=Age))  + geom_violin() + geom_boxplot()
ggplot(Biostat2,aes(sample=Age)) + geom_qq() + geom_qq_line()
shapiro.test(Biostat2$Age)
# Age normally distributed


ggplot(Biostat2, aes(x=`hospitalization days`)) + geom_histogram(aes(y=..density..)) + geom_density()
ggplot(Biostat2, aes(x="",y=`hospitalization days`))  + geom_violin() + geom_boxplot()
ggplot(Biostat2,aes(sample=`hospitalization days`)) + geom_qq() + geom_qq_line()
shapiro.test(Biostat2$`hospitalization days`)

# hospitalization days NOT normally distributed

##3-Use the appropriate descriptive statistics [mean+-SD or median (IQ range)] for each variable
library(tidyverse)
summarise(Biostat2, MeanAge = mean(Age), sdAge = sd(Age), 
          medianHospDays=median(`hospitalization days`), 
          Q1=quantile(`hospitalization days`, 0.25), 
          Q3=quantile(`hospitalization days`, 0.75))
#Age: 9.54 +/- 2.89 [mean +- SD]  & Hospitalization days: 11.5 (9.00 - 14.25) [median(IQ range)]
```

```{r}
###NEW

##1-Read the excel file called 'EX Biostat P2' into R 
Biostat2 <- readxl::read_xlsx("data/EX Biostat P2.xlsx")

gghistogram(Biostat2, x = "Age")

gghistogram(Biostat2, x = "Age", add_density = TRUE, y = "..density..") + stat_overlay_normal_density(col="blue")


ggqqplot(Biostat2, x = "Age")
Biostat2 %>% 
  shapiro_test(vars = "Age")
# Age is normally distributed

gghistogram(Biostat2, x = "hospitalization days")

gghistogram(Biostat2, x = "hospitalization days", add_density = TRUE, y = "..density..") + stat_overlay_normal_density(col="blue")


ggqqplot(Biostat2, x = "hospitalization days")
Biostat2 %>% 
  shapiro_test(vars = "hospitalization days")

# hospitalization days NOT normally distributed

get_summary_stats(Biostat2)

```

